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Abstract. We consider the problem of distinguishing between two arbi- 
trary black-box distributions defined over the domain [n] , given access to 
s samples from both. It is known that in the worst case 0{n^^^) samples 
is both necessary and sufficient, provided that the distributions have Li 
difference of at least e. However, it is also known that in many cases fewer 
samples suffice. We identify a new parameter, that precisely controls the 
number of samples needed, and present an algorithm that requires the 
number of samples only dependent of this parameter and independent 
of the domain size. Also for a large subclass of distributions we provide 
a lower bound, that matches our algorithm up to the polylogarithmic 
bound. 

1 Introduction 

One of the most fundamental challenges facing modern data analysis, is to under- 
stand and infer hidden properties of the data being observed. Property testing 
framework [9161211131^ has recently emerged as one tool to test whether a given 
data set has certain property with high probability with only a few queries. One 
problem that commonly arises in applications is to test whether several sources 
of random samples follow the same distribution or are far apart and this is the 
problem we study in this paper. More specifically, suppose we are given a black 
box that generates independent samples of a distribution P over [n] , a black box 
the generates independent samples of a distribution Q over [n], and finally a 
black box that generates independent samples of distribution T, which is either 
P ov Q. How many samples do we need to decide whether T is identical to P 
or to Q? This problem arises regularly in change detection problems [S], testing 
whether Markov chain is rapidly mixing [2], and other contexts. 

Our Contribution Our results generalize on the results of Batu et al in [2] , they 
have shown that there exists a pair of distributions P, Q on domain [n] with a 
large statistical difference ||P — Q||i>l/2, such that no algorithm can tell apart 
case {P,P) from {P,Q) with o(n^/^) samples. They also provided an algorithm 
that nearly matches the lower bound for a specific pair of distributions. 

In the present paper, instead of analyzing the "hardest" pair of distributions, 
we characterize the property that controls the number of samples one needs to 
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tell a pair of distributions apart. This characterization allows us to factor out 
the dependency on the domain size. Namely, for every two distributions, P and 
Q, that satisfy certain technical properties that we describe below, we establish 
both an algorithm and a lower bound, such that the number of samples both 
necessary and sufficient is J7( jjp^gjj^ ). For the lower bound example of 2 , this 

quantity amounts to n^^^ and the distributions in [2] satisfy the needed technical 
properties, and thus our results generalize upon their result. 

From practical perspective such characterization is important because the 
high level properties of the distributions may be learned empirically (for instance 
it might be known that the potential class of distributions is a power-law) and 
our results allow to significantly reduce the number of samples needed for testing. 

In many respects, our results complement those of Valiant 10 . There it was 
shown that for testing symmetric and continuous properties, it is both necessary 
and sufficient to consider only the high frequency elements. In contrast, we show 
that for our problem the low frequency elements provide all the necessary in- 
formation. This was quite surprising provided that low frequency elements have 
no useful information for continuous properties. For the computability part [lOj 
introduces the canonical tester - an exponential time algorithm that finds all 
feasible underlying data, that could have produced the output. If all checked 
inputs have the property value consistent, it reports it, and otherwise gives ran- 
dom answer. In contrast, our algorithm guarantees that for any underlying pair 
of distributions, the algorithm after observing the sample will be correct with 
high probability, even though there might be a valid input that would cause the 
algorithm to fail. 

Finally, we develop a new technique that allows tight concentration bounds 
analysis of heterogeneous balls and bins problem that might be of independent 
interest. 

Paper Overview In the next section we describe our problem in more detail, 
connect it to closeness problem studied in the earlier work and state our results. 
Section 21 proves useful concentration bounds, and introduces the main technique 
that we use for the algorithm analysis. Section [S] provides algorithm and analysis, 
and finally in the section [6] we prove our lower bounds. 

2 Problem Formulation 

We consider arbitrary distributions over the domain [n] . We assume that the only 
way of interacting with a distribution is through a blackbox sampling mechanism. 
The main problem we consider is as follows: 

Problem 1 (Distinguishability problem). Given "training phase" of s samples 
from X and s samples from Y, and a "testing phase" of a sample of size s 
from either X ot Y, output whether first or second distribution generated the 
testing sample. 



We say that an algorithm solves the distinguishability problem for a class of 
distribution pairs with s samples, if for any {X^Y) G V, with probability at 
least 1 — P°'-y^^s{s) outputs correct answer. Further, if X and Y are identical, 
it outputs first or second with probability at least 0.5 — ESM^^ilL^ 

We show that the distringuishability problem is equivalent to the following 
problem studied in [2110] : 

Problem 2 ( Closeness Problemf^). Given s samples from X and s samples from 
Y, decide whether X and Y are almost identical or far apart with acceptable 

, , polyloq(s) 

error at most - — .i ^ ' . 

An algorithm solves the closeness problem for a class of distribution pairs V, 
if for every pair (X, Y) e V, it outputs "different" with probability at least 
1 — HHMh^M^ for every input of the form {X,X) it outputs "same" with 
probability at least 1 — ^ 

Our first observation is that if either of the problems can be solved for a 
certain class of distribution pairs, then the other can be solved with at most 
logarithmically more samples and time. The following lemma formalizes this 
statement, and due to space limitations the proof is deferred to appendix. 

Lemma 1. // there is an algorithm that solves distinguishability problem for a 
class of distribution pairs V with s then there is an algorithm that solves identity 
problem for class V with at most O(slogs) samples. 

If there is an algorithm that solves closeness problem for class P with at most 
s samples, then there is an algorithm that solves distinguishability problem with 
at most s samples. 

3 Results 

Our algorithmic results can be described as follows: 

I P — Q\ p 

Theorem 1. Consider a class of distribution pairs such as \ \p^q\\1 ^ C(, and 

let s = 60(1 logQ;|)''/^/Q;, and each pi and qi is at most then Algorithm\^pro- 
duces correct answer with probability at least 1 — c/s^, where c is some universal 
constant. 

Essentially the theorem abouve states that ! p"*"^!!^ controls the distinguisha- 
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bility of distribution pair. There are several interesting cases, if ||-P — (5||2 is 
comparable to either ||P||2 or ||(5||2, then the results says that s ~ 1/||P — (5||2 
suffices. This generalizes the observation from [2] that if the L2 difference is large 
then constant number of samples suffices. 

We also note that the condition < is a technical condition that guar- 
antees that any fixed element has expectation of appearance of at most 1/2. In 
other words, no elements can be expected to be seen in the sample with high 
probability. This is requirement is particularly striking, given that the results 
of [10] say that elements that have low expectation, are provably not useful 



when testing continuous properties. Further exploiting this contrast is a very 
interesting open direction. 

For the lower bound part, our results apply to a special class of distributions 
that we call weakly disjoint distributions 

Definition 1. Distributions P and Q are weakly disjoint if every element x 
satisfies one of the following: 

(1) = Qx, (2) Px > and = 0, (3) > and p^ = 

We denote the set of elements such that p^ = by €{P,Q), and the rest are 
denoted by D{P,Q) 

It is worth noting that the known worst case examples of [2] belong to this 
class. We conjecture that weakly disjoint distributions represent the hardest case 
for lower bound, and all other distributions need fewer samples, and that the 
result above generalizes to all distributions. 

For the rest of the paper we concentrate on the distinguishability problem, 
but results through lemma [1] immediately apply to closeness problem. Now we 
formulate our lower bounds results: 

Theorem 2. Let P and Q are weakly disjoint distributions, and let 

where c is some universal constant. No algorithm can solve a distinguishability 
problem for a class of distritribution pairs defined by arbitrary permutations of 
iP,Q), e.g.V = {{{nV,nQ)}. 

The first, second and third constraints in min expression above are technical as- 
sumptions. In fact for many distributions including the worst-case scenario of [2] 
<< min{l/||P- Q||3,l/||P||oo,l/||Q||oo}, and hence those constraits 
can be dropped. 



4 Preliminaries 



4.1 Distinguisiiing between sequences of Bernoulli random variables 

Consider two sequences of random Bernoulli variables Xi and yi. In this section 
we characterize when the sign of the observations oiY^Xi — Y^yi can be expected 
to be the same as sign of E a;^ — ^ j/i] with high probability. We defer the 
proof to the appendix. 

Lemma 2. Suppose {xi} and {yi} is a sequence of Bernoulli random variables 
such that E Xi] — a and E yi] = (3, where a < l3. Then 



Prl^Xi >^yi\ < 2 exp 



4.2 Weight Concentration Results For Heterogeneous Balls and 
Bins 

Consider a sample S = {si, S2, • ■ • , Ss} of size s from a fixed distribution P 
over the domain [n]. Let the sample configuration be {ci, . . . c„}, where Ci is the 
number of times element i was selected. A standard interpretation is that sample 
represents s balls being dropped into n bins, where P describes the probability 
of a ball going into each bin. The sample configuration is the final distribution 
of balls in the bins. Note that J^'^j — ^- We show tight concentration for the 
quantity '^i'^ii fo^' non-negative 

Note that Ci and Cj are correlated and therefore the terms in the sum are 
not independent. An immediate approach would be to consider a contribution 
of ith sample to the sum. It is easy to see that the contribution is bounded by 
maxai, and thus one can apply McDiarmid inequality [5], however the resulting 
bound would be too weak since we have to apply an uniform upper bound. 

In what follows we call the sampling procedure , where each sample is selected 
independently from distribtuion over the domain [n] the type I sampling. Now, 
we consider a different kind of sampling that we call type II. 

Definition 2 (Type II sampling). For each i in [n] we flip pi-hiased coin s 
times, and select the corresponding element every time we get head. 

We show that for almost any sample selection of type (I), the corresponding 
configuration in type II sampling would have similar weight in type II. The 
weight of all configurations in type ( I) not satisfying this constraint is o(l/s) 
£i. Once we show that, then any concentration bound in Type II will translate 
to corresponding concentration bound for type I. 

We use P^^'> [•] and p(^^) [•] to denote probability according to type (I) and 
type (II) sampling. Due to space limitations all the proofs are deferred to ap- 
pendix. Now we show the lower and upper bound connections between type I 
and type II sampling. 

Lemma 3. For every configuration C = such that ij < Ins, and 

^ ij = s' , where s' € [s ± ^/s] 

-yip(") [C] < P^''> [C] < 30s3/2p(^^) 
3 

where P'-^^ [C] is a probability of observing C in type I sampling with s' elements 
and P(") [C] is a probability of observing Cin type II sampling with s elements. 

The following lemma bounds the total probability of a configuration which ele- 
ments appearing more than In s times. The proof is deferred to appendix. 

Lemma 4. In type I sampling, the probability there exists an element that was 
sampled more than In s times is at most i " " * ^ 



^ In fact the technique applies to arbitrary bounded functions. 



Now we formalize the translation between concentration bounds for Type (I) 
and Type (II) samplings. 

Lemma 5. Suppose we sample from distribution P, s times using type I and 
type II sampling resulting in configurations C. Let A = {a^} be an arbitrary 
vector with non-negative elements, and r > 0. Then 

pii) [\w - E [M^] I > r] < 30s3/2p(") _ g j^/j | > 1 
where W = YJi=i c^iC-i- 

Lemma 6. Consider s samples selected using Type I sampling from the dis- 
tribution P — {pi\, where s > 10, and A = 'is arbitrary vector. Let 
W = "iCi. Then 



oln In s ' 



Pr 



|I^-E[iy] I > 2(lns)3/2 



1 

< — 

- s2 



5 Algorithm and Analysis 

At the high level our algorithm works as follows. First, we check if the two-norms 
of distributions P and Q are sufficiently far away from each other. If it is the 
case, then we can decide simply by looking at the estimates of the 2-norm of 
P, Q and T. On the other hand if ||P||2 ~ IIQIb then we show that counting 
the number of collisions of sample from T, with P and Q, and then choosing 
the one that has higher number of collisions gives the correct answer with high 
probability. Algorithm [5] provides a full description on 2-norm estimation. The 
idea is to estimate a probability mass of a sample S by computing the number of 
collisions of fresh samples with S, and then noting that the expected mass of a 
sample of size I is Z||P||2. One important caveat is that if S contains a particular 
element more than once, we need to carefully compute the collisions in such 
a way to keep the probability of a collision at ^||P||2 and to achieve that, we 
split our sampling into maxc^ phases. During phase i we only count collisions 
with elements that have occurred at least i times. For more details we refer to 
algorithm [1] which is used as subroutine for both [2] and the main algorithm [3l 
For the analysis we mainly use the technique developed in Section |4l 



Lemma 7. AlgorithmUl outputs a set, S, such that E [\S\] = X]"=i CiPi. 

Proof. Let Wi be the contribution out of s^. E [wi] — Pj^cj>i Summing over 

all elements of S and use linearity of expectation we have: E — J^^i ^ [^i] — 
Silli Sj=i-Pjlcj>i = J2'i=iPi'^i' where the last equality follows from changing 
the summation order. ■ 



Lemma 8. The total number of elements selected after I iterations in step\^is 
a sum of independent Bernoulli random variables, and its expectation is IW . 



Sampling according to given pattern 

Algorithm 1 Sampling according to given pattern 

Input: Configuration {ci, . . . c„}, wliere m > Ci > 0. Distribution P 

Output: Multi-Set of elements S, such that E [l^l] = ^"^^ CiPi 

Description: 

1. Sample m elements from P, si, . . . Sm 

2. For each Si, if Cg. > i include Si into set 5* 



Algorithm 2 Computing 2-norm of distribution P 
Input: Distribution P, accuracy parameter I 
Output: Approximate value of ||P||2 
Description: 

1. Select I samples from P, let ci, . . . c„ be the configuration. Note that the expected 
weight W of the configuration is ijl-PHi 

2. If max Ci > log I report failure 

3. Sample with repetition using Algorithm [1] Z times. Let be the respective set size 
from ith simulation. Note that E [ri] = W 

4. Report ri/l^ as the approximate value 



Proof. Indeed, the expected number of selections for every invocation Algorithm 
[T] is W , and that in itself is a sum of m < log s Bernoulli variables. Thus, we 
have Itlogs Bernoulli random variables with the total expected weight of IW . 



Furthermore, the total number of samples used by the algorithm is bounded. 

Lemma 9. The total number of samples used by Algorithm\^is at most 2l\ogl. 

Now we are ready to prove the main property of the algorithm [21 

Lemma 10 (Concentration results for 2-norm estimation). Suppose the 
Algorithm\^ is run for distributions P and T with parameter I > 10. If P = T 
then the estimate for \\P\\2 is greater than estimate for \\T\\2 with probability 
1/2. // 

Kl|7^ll2- 11^112) >4(lnO'/'(||/'||2 + ||r||2) 

then the estimate for \ \P\\2 is smaller than the estimate for \ \T\\2 with probability 
at least 1 — c/l'^ , where c is some universal constant. 

Proof. 



The first part is due to the symmetry. For the second part, we use 

, i{\nlf'^)\\T\\2 

or Wp > s||P||2 + 3(lnZ)3/^||P||2 with probability at most ^. Therefore with 

f llTlls). Using 



Lemma [6] to note that the weight of selection Wt < sIlT^Hi ~ 

|P||2 with probability at most ^. 
probability at least 1 - ^ we have \Wt - Wp\ > (ln/)3/2(||p||2 
lemmas [8] and [2] we have: 



Wt < Wp 



<o(-i)+2exp 



P{Wt - Wpf 



%{Wt + Wp) 



<^+2cxp 



\\Q\[. 



^2^ 



8/2 
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Bringing it all together: main algorithm and analysis 

Algorithm 3 Distinguishing between two distributions 

Input: Blackbox providing samples from P, Q and T. Estimate s — jj^^^jj^ [f| Out- 
put: "P" if T is P and "Q" otherwise 
Description: 

1. Compute 1/2 norm of P, Q and T using Algorithm O using accuracy parameter 
I — 30s(ln s)^''^. Repeat logs times. Let Pi, Qi and Ti denote the estimated norms 
in ith iteration. 

2. If Ti > Pi for all i or Ti < Pi for all i then report "Q" and quit. 

3. If Ti > Qi for all i ox Ti < Qi for all i then report "P" and quit. 

4. Else: 

(a) Training Phase: sample / = 30(ln s)^''^s elements from distributions P and Q. 
Let Cp and Cq denote the configuration of elements that were selected. 

(b) Testing Phase: use Algorithm [1] to sample with repetition s times, on both Cp 
and Cq using fresh sample each time. Let cp and cq denote the total size of 
the Algorithm [1] output for Cp and Cq respectively. 

(c) If Cp > Cq report "P" , otherwise report "Q" . 



where we have used exp(— In^ I) < for ^ > 10. ■ 

Let Wp and Wq denote the total probability mass of sample selected by P 
and Q in distribution T. In other words Wp — X]r=i^sp(i)' where sp{i) is the 
i-th sample from P. 

Now, consider (e.g. T = P). We have: E [Wp] = sYT^^ipI and E [Wq] = 
sJ27=iPi1i^ where i-th term is expected contribution of i-th element of the dis- 
tribution into the total sum for a single selection. Therefore we have 

n 

E[Wp-WQ]=sJ2P^iP^-<li) 
1=1 

Similarly in hypothesis H2 we have: 

n 

E[WQ-Wp]=sJ2q^iQ^-P^) 
i=l 

The rest of the analysis proceeds as follows, we first show that if the algo- 
rithm passes the first stage then with high probability |(||P||| — HQHDl ^ 
4(lns)^/^||P -I- QII2, in which case the majority voting on collision counts gives 
the desired result. The following lemma is almost immediate from Lemma fTOl 

Lemma 11 (Correctness of the case when ||P||2 76 HQIb)- If 

n 

s[Y,{pI - q^,)[>\n''\l)m\2 + \\Q\\2) 

i=l 



the algorithm terminates on or before step of algorithm with probability at 
least 1 — c/s^. Further, the probability of incorrect answer if it terminates is at 
most , for some constant c. 



Proof. Without loss of generality we can assume T = P. Using the result of 
lemma [TUl the probability that P^ < or for all i is at most < 
Thus probability of incorrect answer is at most p- . 

Second, if the condition of the lemma satisfied, then from lemma [TUl and the 
union bound the probability of inconsistent measurements is at most < 



1 



Given the lemma above, we have that if the algorithm passed stage 1, then 
-4(ln/)3/2(||P||2 + \\Q\\2) < KJ^pI - ^ 4(lnO^/^(||F||2 + WQh) 

Therefore in hypothesis Hi: 

■\\P-Q\\l 4(lnZ)3/2 



E [Wp - Wq] = liJ2P^-J2 ^ 



-i\\P\\2 + \\Q\\2) 



We have In^ = ln(sln"^^^s) < Inlns + logs + 3 < 1.5 logs, and thus I > 
20{\nlf/^ \\p^^\k . Substituting we have: 



ll^'-Qli: 
E [Wp -Wq]>1 



>6(lnZ)3/2(||P||2 + ||Q||2) 



Applying Lemma [5] with weight function P, the probability that either Wp and 
Wq deviates from its expectation by more than 2(lnZ)^/^||P||2 is at most for 
a fixed constant c. Thus 



Pr 



W^p-W^Q<(lnO'/'(||-P||2 + ||g||2) 



< 



C2 



(1) 



Therefore Wp and Wq with high probability are far apart, and could be dis- 
tinguished using the bounds from Lemma [2] Indeed, the total expected number 
of hits is sWp and sWq, for both P and Q, thus the probability that the total 
number of hits is for Cp is smaller than Cq in Hi is at most: 



r siWp-WQ^, 

Pr[c,<c,]<exp[- '^^^^l' ]<e.p 



sln3/||P + g||^ 



l^ll^ 



< 



(2) 



Combining ([T]) and ^ we have that for some universal constant c, with proba- 
bility at least 1 — p-, Cp will receive more hits than Cq and symmetrically in 
H2, Cq will receive more hits than Cp. Thus the algorithm produces correct 
answer with high probability and the proof of Theorem [T] is immediate. 



6 Lower Bounds for Weakly Disjoint Distributions 

In this Section we prove Theorem [51 First we observe that for any fixed pair of 
distributions, there is in fact an algorithm that can differentiate between them 
with far fewer samples than our lower-bound theorem dictate^. Thus, the main 
challenge is to prove that there is no universal algorithm that can differentiate 
between arbitrary pairs of distributions. Here, we show that even the simpler 
problem where the pair of distributions is fixed, and a random permutation tt 
is applied to both of them, there is no algorithm that can differentiate between 
ttP, and ttQ. Since this problem is simpler than the original problem (we know 
the distribution shape), the lower bound applies to the original problem. 

Problem 3 (Differentiating between two known distribution with unknown per- 
mutation). Suppose P and Q are two known distributions on domain D. Solve 
distinguishability problem on the class of distributions defined by (ttP, ttQ), for 
all permutations tt. 

In Problem [3l the algorithm needs to solve the problem for every tt, thus if 
such algorithm exists, it would be able to solve distinguishability problem with 
TT chosen at random and from the perspective of the algorithm, elements that 
were chosen the same number of times, are equivalent. Thus, the only factor the 
algorithm can differentiate upon are counts of how often each element appeared 
in different phases. More specifically we will use j, k)\ notation to denote the 
number of elements, which were chosen i-times while sampling from P, j-times 
while sampling from Q in the training phase and fc-times during the testing 
phase. We also use notation *)| to denote the total number of elements 
that were selected i and j times during training phase and arbitrary number 
of times during the testing phase. Finally and j, +)| to denote the number 
of elements that were selected at least once during the testing phase. In what 
follows we use Hi and H2 to denote the two possible hypotheses that the testing 
distribution is in fact P or Q respectively. 

To prove this theorem we simplify the problem further by disclosing some 
additional information about distributions, this allows us to show that some data 
in the input possesses no information and could be ignored. Specifically, we rely 
on the following variant of the problem: 

Problem 4 (Differentiating with a hint). Suppose P and Q are two known dis- 
tributions on D and tt is unknown permutation on D. For each element that 
satisfies one of the following conditions the algorithm is revealed whether it be- 
longs to common or disjoint set. (a) selected at least once while sampling from 
T and at least twice while sampling from P or Q (b) selected at least twice while 
sampling from P or Q, and belongs to the £(7rP, ttQ) / The set of all elements for 
which their probabilities are known is called hint. Given the hint, differentiate 
between ttP and ttQ. 

^ For instance distinguishing between two uniform distribution on half domain a con- 
stant number of samples is sufficient 



Note that Problem 2] is immediately reducible to Problem [3J thus a lower bound 
for|3]immediately implies a lower bound for[31 If an element from the disjoint part 
of P and Q has its identity revealed, then the algorithm can immediately output 
the correct answer. We call such elements helpful. We call other revealed elements 
unhelpful, note that the set that the unhelpful elements is fully determined by the 
training phase (these are the elements that were selected at least twice during 
training phase and belong to the common set). First, we prove the bound on 
the probability of observing helpful elements. Later we show that knowledge of 
unhelp does not reveal much to the algorithm. 

Lemma 12. // the number of samples s < jfpf^^' then the probability that 
there is one or more helpful elements is at most -tj. 

Proof. For every element that has probability of selection during testing phase 
p, the probability that it becomes helpful is at most (^3*)^'^ < ^"g^ < l.bs'^p^ 
if it belongs to disjoint section, and otherwise. Therefore, the total expected 
number of helpful elements is LSs'^HP — QWs- Using Markov inequality we im- 
mediately have that probability of observing one or more helpful element is at 
™°st i.5,3||p_Q||. < M as desired. ■ 

Since the probability of observing helpful element is bounded by ^ , it suffices 
to prove that any algorithm that does not observe any helpful elements, is correct 
with probability at most 1 — The next step is to show that disclosed 

elements from the common part are ignored by the optimal algorithm. 

Lemma 13. The optimal algorithm does not depend on the set of unhelpful 
elements. 

Proof. More formally, let C denotes the testing configuration, and let Y de- 
notes the unhelpful part of hint. Let A{C, Y) is the optimal algorithm that 
takes as that outputs Hi or H2. Suppose there exists Y' and Y" such that 
A{C, Y') ^ A{C, y"), and without loss of generality we assume A(C, Y') = Hi 
and A(C,Y") = H2. Of all optimal algorithms we chose the one that mini- 
mizes the number of triples {C,Y,Y') that satisfy this property. Without loss 
of generality Pr[C|ffi] > Pr[C|iJ2] and let Ai be the modification of A such 
that Ai{C, Y") = Hi. But then the total probability of error for Ai will be the 
Pr [C\Hi] PrY" - Pr [C\H2] Pr [F"], which is smaller or equal than A contradic- 
tion with either optimality or minimility of A. ■ 

So far we showed that helpful elements terminate the algorithm, and non- 
helpful do not affect the outcome. Therefore, the only signatures (i, j, k) that the 
algorithm has knowledge of will belong to the following set: 

{(0,0,0), (0,0,1), (0,1,0), (1,0,0), (0,1,1), (1,0,1), (0,0, 2), (2, 0,0), (0,2,0)}. 

Consider the following random variables |(0, 1,*)|, |(1,0, *)|, |(0, 0,*)|, 
1(0, 2, *)|,|(2, 0, *)|. They are fully determined by the training phase and thus 
are independent of the hypotheses. We call these random variables the training 
configuration. The following lemma is immediate and the proof is deferred to 
appendix. 



Lemma 14. // the training configuration is fixed then the values |(0, 1,1)|, 
1(1,0,1)1, 1(0,0,1)1 fully determine all the data available to the algorithm. 

Therefore for fixed configuration of the training phase, the output of optimal 
algorithm only depends on |(0,1,1)|, |(1,0, 1)|, |(0, 0, 1)|. Consider h{i,j) the 
total number of elements that were selected during testing phase, and that have 
signature {i,j,*). Note that h{i,j) — J2k=i^ ^ l(*iJ:^)l- 

Now we prove that no algorithm by observing h{0,l), h{l,0), h{0,0) can 
have error probability less than 1/17. Again we defer the proof to the appendix 
due to space constraints. 

Lemma 15. Let Cp,Dp and Cq,Dq he the probability masses of subsets of 
£(7rP, ttQ) and T){'kP,'kQ) that were sampled during training phase of P and Q 
respectively. Assume without loss of generality Pr [Dp > Dq] > 1/2. Then with 
probability at least 1/2, in hypotheses Hi for observed /i(l,0) — a, /i(0, 1) — b 
h(0,0) = c, we have: 

Pr[fe(l,0)^a,/t(0,l)^6,/t(0,0)^c|gi] ^ 

Pr [ /i(l, 0) = a, h{0, 1) = 6, h{0, 0) = c\H2] ~ ^ ' 

From this lemma it immediately follows that any algorithm will be wrong with 
probability at least 1/16. Now we are ready to prove the main theorem of this 
section. 

Proof of theorem [2l By lemmas [T51 and [HI the optimal algorithm can ignore 
all the elements that has signatures other than |(0, 1,1)|, |(1,0, 1)| and |(0,0, 1)|. 
Now, suppose there exists an algorithm A{x,y,z) that only observes |(0, 1, 1)|, 
1(1, 0, 1)1, 1(0, 0, 1)1, and has error probability less then 1/100. Then the algorithm 
for |(0,1,+)|, 1(1,0, +)| and |(0,0, +)| that just substitutes the latter into former, 
will have error probability at most 1/100 + ^ < 1/17. Indeed, let x — |(0, 1, +)|, 
y — 1(1,0, +)| and z ~ |(0,0, +)| — 2(s — x — y) and execute the algorithm 
A{x,y,z) will have error at most 1/100+ Thus, we contradicted Lemma [T51 



7 Conclusions and Open Directions 

Perhaps the most interesting immediate extension to our work is to incorporate 
high-frequency elements in our analysis to eliminate the technical assumptions 
that we make in our theorems theorem. One possible approach is to combine 
techniques of Valiant and Micali, but it remains to be seen if the hybrid approach 
will produce better results. Proving or disproving our conjecture that weakly 
disjoint distribution are indeed the hardest when it comes to telling distribution 
apart would also be an important technical result. 

The other directions is to extend the techniques of section For instance 
this techniques could be used to estimate various concentration bounds on how 
many heterogeneous bins will receive exactly t balls and it remains an interesting 
question on how far those techniques could be pushed. On the more technical 



side an interesting question is whether (under some reasonable assumption), the 
probabihty ratio between type I and type II configurations is, in fact bounded by 
constant, rather than by 0(s). By using tighter analysis we can in fact show that 
this ratio is in fact 0(^/i) though reducing it further remains an open question. 
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A Equivalence of identity and distinguishability problems 

Proof of Lemma [U We first show how to simulate a correct algorithm Ax for 
the identity problem using an algorithm Av for the distinguishability problem. 
We run At> for Slogs times on fresh input, where the sample for the testing 
phase is always taken from the first distribution. If the answer is always the 
same say "different" , otherwise say the "same" . If the two distributions are the 
same, then Av gives random answer, therefore the probability for Av produce 
the same answer for log s iterations (and hence Ax producing wrong answer) is 
at most (3/5)^'°8s < Similarly if distributions are different, the probability 
for Av go give different answer means that it has mistaken at least once, which 
would happen with probability at most '°g(3^)p^°'y'o3(s) ^Jesired as it remains 
polylograthmic. 

For the other direction we simulate a correct algorithm for the distinguisha- 
bility problem using an algorithm Ax for the identity problem. We test a new 



sample against both X and Y , if Ax says that both are the same or both are 
different, output X with probabihty 0.5. If the output of Ax is "the same as X" 
output "X", otherwise output "F"'. 

If X and Y are the same the testing algorithm will say "the same" with 
probability at least 1 — 2polylog{s)/s^, thus distinguishing algorithm will say X 
with probability at least 0.5 — polylog{s)/ , and Y with probability at least 
0.5 — polylog{s) I . If X and Y are different, then the testing algorithm will 
make a mistake with probability at most 2polylog{s)/s^, and thus the identity 
algorithm will produce different answers, and the distinguishing algorithm will 
be correct with probability at least 1 — 2polylog{s) / . ■ 



B Proofs Prom Subsection 14.11 



Lemma 16. Let {xi\ and {y^} be a sequence of N independent, though not 
necessarily identical Bernoulli random variables such that E Xi\ = pN and 

E E y^] = qN and pN < qN . If N > ^' '"f/^)^^"^ ' ^^^"^ 

Proof. Using Chernoff inequality for any a we have 

exp[Q! X pN] 



Pr 



> {l + a)pN 



< 



On the other hand each for we have: 



Pr 



.4=1 



(l + a)(l+")pAf 

Nq/3^. 



< exp 



(4) 



(5) 



Choose 1 - /3 = 2^, and 1 + a = Since p<q, p,a>0. 

First consider the case where p < q/8. Thus substituting it into (jlj we get: 



< 



^ , < exp 

y 2p ) 



< exp[-7V-^^— ^] < exp 



2 4 
(g-p)2l 



< exph7V(| + f)] 



-N 



4(p + q) 



where we have used (p + q)/{2p) > 4.5 > e^'^. Similarly for ([5]) we have: 



Pr 



^y, < {l-l3)qN 



< exp [-N{p - q f/8q] < 



{q-p? 



Hp + q) 



(6) 



If, on the other hand q > p > q/8, then a = {p + q)/2p — 1 < 4, and we can use 
the following variant of Chernoff bound: 



p + q 



< exp[-piVaV4] < ex.p[-N{q-pf /{Ap)] < cxp[-N{q-pf /2{p+q)] 

(7) 



and similarly: 



p + q 



< exp[-pNp^] < exp[-N{q-pf/{2p)] < cxp[-N{q-p)'^ /{p+q)] 



Combining we have the desired result. ■ 
Lemma [2] immediately follows from lemma [16] 

C Proofs for concentration bounds for balls and bins 

Proof of Lemma [3j Note that type II sampling is over s elements whereas 
type I is over s' elements. We have, 



zi!i2! ...inl 



whereas 

p(") [c] = n 



Recalling that ij < Ins, we have s'^ > -^i 
therefore we have: 



> (s — lns)*J , and *j — s' 



where we have used = s'^ (1 + ^^^)* and e*^* > (1 
Substituting we have: 



2s' 



> 



(8) 



P(^^) [C] ^ s' 



Pii)[C] - (2es)s'! -IJ- - 9esV? 



> — exp 



> 



1 



10e2s3/2 



where in the first transition we used the fact that (1 — a;/(s + l))'' > e ^ for 
X < 1 and s > 1, and the fact that pj{s + 1) < 1 and finally Stirling formula 

s'l < ?>\/s's' /e' . As desired. To get the lower bound we have: 

^ n(i - p^y-' ^ ^ n(i - v-s)- d - p^y m 



3e^ 
< — ;= exp 
- 2Vi 



< 



2^ 



(10) 



where in the second transition we used n\ > 2^/n(n/ey^ and pj < l/l, in the 
third we have used (1 — l/s)"** < 3, and (1 — x/s)" < e^^. ■ 



Proof of Lemma|4l Indeed, let d,; denote the total number of times that sample 
Si was repeated in S. Recall that we have piS < 1/2 for all i, and thus we can 
just uniformly upper bound it: 

^ r, ,1 r ,1 expFlns — Pisl , , 

Pr [di > In s] < max Pr [ck > In s] < max hTT^ (1^) 

k k In s 

PkS 

- ^ glnlns ~ \S J ^ ^ 

using union bound over all di gives us the desired result. ■ 



Proof of Lemma [5l Let us denote the set of all configurations satisfying 
I X]r=i '^i'^i ~ ^ E"=i c«iCz] I > ''j by v. For any configuration C G V such that 
maxQ < Ins, we can apply lemma H And thus P^^' [C] < SOs^/^pC//) 
addition the total probability mass of all configurations C such that max Ci > In s 
is at most Therefore p(^) [V] < SOs^/Spr - E [W] \ > r] + as 

desired. ■ 



Proof of Lemma [6l Consider type II sampling with s samples and let W' 
be the total selection weight. Note that F,[W] = E[W]' = s\\P\\2- Further, we 
have W = J2i=i'^iPi' where c- is the count of how many times i-th element 
was chosen. The individual terms in the sum are independent, however they 
are not bounded, so we cannot use Hoeffding inequality [7]. Instead we consider 
V' = ^^^]^ min(cj. In s)pi. Using Hoeffding inequality we have: 



pill) 



\V' -E[V'] \ > 1.5(lns) 



3/2 



< exp 



-4.5 In'' s 



(Ins)^ELi"? 



1 



furthermore, because of the lemma |4] we have: 



E [V - W] < Pr [V ^= W] I \A\\^s < 



oln In s 



< 



and hence 



|iy'-E[M^]'| > 2(lns) 



3/2 



< Pr 



\W'-E[V]'\> 1.5(lns)3/2||F||2' 



where we have used that V < j^). Thus, using the concentration lemma 

[SJ we have 



Pr 



|W-E[V^"] I > 2(lnsf/^\\P\\2 



< Pr 



\W' --EiW]' \2{\nsf/^\\P\\2 



-.In li] 



< 1/5' 



D Proofs for lower bounds 



Proof of Lemma 1141 Indeed: 

|(0,0,2)|-(s-|(l,0,l)|-|(0,l,l)|-|(0,0,l)|)/2 

1(1,0,0)1 = 1(1, 0,*)|- 1(1,0,1)1, 1(0,1,0)1 = 1(0, 1,*)|- 1(0,1,1)1 

1(0,0,0)1 = 1(0,0, *)|- 1(0,0,2)1 - 1(0,0,1)1, 1(0,2,0)1 = |(0,2,*)|, |(2,0,0)| = |(0,2,*)| 



Proof of Lemma HU Denote Pr [/i(l, 0) = a, /i(0, 1) = b, h{0, 0) = c\H,] as tt,. 
We need to bound tti/ttz. We have tti = (Cp + i:ip)"C^(l - Cp - Cq - Dp f 
and ^2 = {CpT{Cq + Dq)\1 - Cp - Cq - DqY therefore 



<2exp[— a--^5-- c] (14) 

Cp Oq i — Op — Oq — Up 

(15) 

Now, we replace a = A- 5a with 6 = &o + i^fc and c = cq + (5c, where oq = 
(Cp + Dp)s, bo — Cqs, and cq = (1 — Cp — Cq — -Dp)s and we get: 

^ < 2exp[#.]exp[$^<5.]exp[^4]exp[ ^^^^l . 1 



7r2 Cp Cp Cq 1 — Cp — Cq — Dp 

Note E [So] — E [Sij] = E [(5c] = 0, and with probabihty at least 0.5, Sa < 
3a/(Cp + £ip)s, Sb < 3i/C^ and (5c < 3^/(1 - Cp - Cq - i:ip)s. Substitut- 
ing we have with high probabihty: 

II < 2„p,a.|.xp[g£^]„p[gs#i„p, „/i^;-'l°i^„ ,1 (16) 

7r2 Cp VC^P V^Q ^(1 ~Cp~Cq~ Dp) 

Let 11^1)112 denotes the 2 norm of the disjoint part of P and Q , and ||Pc||2 the 
the 2-norin of common part P (or Q). We have \\P - Q\\l = WPoWj + \\Pq\\1 
and IIP + QII2 < \\Pc + Pd\\2. 

With high probabihty we have Dp < 2||Pd|||s < where we used 

concentration bounds and substituted s < JoJ\P^^^qI^ — stU^Hfp'- Similarly 
£'q < 11^^112/4 and Cp > H|M£, therefore we have 

Dls 2||Fcll9S 
Cp -16||Pc||i^-'/'" 

Substituting in (jl6p and using the fact the last exponent is o(l), we get the 
desired result. ■ 



